JCO Clinical Cancer Informatics
● American Society of Clinical Oncology (ASCO)
Preprints posted in the last 90 days, ranked by how well they match JCO Clinical Cancer Informatics's content profile, based on 14 papers previously published here. The average preprint has a 0.13% match score for this journal, so anything above that is already an above-average fit.
Salome, P.; Knoll, M.; Walz, D.; Cogno, N.; Dedeoglu, A. S.; Qi, A. L.; Isakoff, S. J.; Abdollahi, A.; Jimenez, R. B.; Bitterman, D. S.; Paganetti, H.; Chamseddine, I.
Show abstract
Introduction: Manual data extraction from unstructured clinical notes is labor-intensive and impractical for large-scale clinical and research operations. Existing automated approaches typically require large language models, dedicated computational infrastructure, and/or task-specific fine-tuning that depends on curated data. The objective of this study is to enable accurate extraction with smaller locally deployed models using a disease-site specific pipeline and prompt configuration that are optimized and reusable. Materials/Methods: We developed OncoRAG, a four-phase pipeline that (1) generates feature-specific search terms via ontology enrichment, (2) constructs a clinical knowledge graph from notes using biomedical named entity recognition, (3) retrieves relevant context using graph-diffusion reranking, and (4) extracts features via structured prompts. We ran OncoRAG using Microsoft Phi-3-medium-instruct (14B parameters), a midsize language model deployed locally via Ollama. The pipeline was applied to three cohorts: triple-negative breast cancer (TNBC; npatients=104, nfeatures=42; primary development), recurrent high-grade glioma (RiCi; npatients=191, nfeatures=19; cross-lingual validation in German), and MIMIC-IV (npatients=100, nfeatures=10; external testing). Downstream task utility was assessed by comparing survival models for 3-year progression-free survival built from automatically extracted versus manually curated features. Results: The pipeline achieved mean F1 scores of 0.80 +/- 0.07 (TNBC; npatients=44, nfeatures=42), 0.79 +/- 0.12 (RiCi; npatients=61, nfeatures=19), and 0.84 +/- 0.06 (MIMIC-IV; npatients=100, nfeatures=10) on test sets under the automatic configuration. Compared to direct LLM prompting and naive RAG baselines, OncoRAG improved the mean F1-score by 0.19 to 0.22 and 0.17 to 0.19, respectively. Manual configuration refinement further improved the F1-score to 0.83 (TNBC) and 0.81 (RiCi), with no change in MIMIC-IV. Extraction time averaged 1.7-1.9 seconds per feature with the 14B model. Substituting a smaller 3.8B model reduced extraction time by 57%, with a decrease in F1-score (0.03-0.10). For TNBC, the extraction time was reduced from approximately two weeks of manual abstraction to under 2.5 hours. In an exploratory survival analysis, models using automatically extracted features showed a comparable C-index to those with manual curation (0.77 vs 0.76; 12 events). Conclusions: OncoRAG, deployed locally using a mid-size language model, achieved accurate feature extraction from multilingual oncology notes without fine-tuning. It was validated against manual extraction for both retrieval accuracy and survival model development. This locally deployable approach, which requires no external data sharing, addresses a critical bottleneck in scalable oncology research.
Xu, S.; Wang, Z.; Wang, H.; Ding, Z.; Zou, Y.; Cao, Y.
Show abstract
Online cancer peer-support communities generate large volumes of patient-authored and caregiver-authored text that may reflect distress, coping, and informational needs. Automated emotional tone classification could support scalable monitoring, but supervised modeling depends on label quality and may benefit from explicit context features. Using the Mental Health Insights: Vulnerable Cancer Survivors & Caregivers dataset, we compared five model families (TF-IDF Logistic Regression, Random Forest, LightGBM, GRU, and fine-tuned ALBERT) on a three-class target (Negative/Neutral/Positive) derived from four original categories. We introduced two extensions: (i) LLM-based annotation to generate parallel "AI labels" and (ii) token-based augmentation that prepends LLM-extracted structured variables (reporter role and cancer type) to the post text. Models were trained with a 60/20/20 stratified train/validation/test split, with hyperparameters selected on validation data only. Test performance was summarized using weighted F1 and macro one-vs-rest AUC with bootstrap confidence intervals, with paired comparisons based on McNemar tests and false discovery rate adjustment. The LLM annotator produced substantial redistribution in the four-class label space, shifting prevalence toward very negative relative to the original labels; the shift persisted but attenuated after collapsing to three classes. Across all model families, token augmen-tation improved held-out performance, with the largest gains for GRU and consistent improvements for ALBERT. Augmentation also reduced polarity-reversing errors (Nega-{leftrightarrow} tive Positive) for ALBERT, while adjacent errors (Negative {leftrightarrow} Neutral) remained the dominant residual failure mode. These results indicate that LLM-based supervision can introduce systematic measurement shifts that require auditing, yet LLM-extracted context incorporated via simple token augmentation provides a pragmatic, model-agnostic mechanism to improve downstream emotional tone classification for supportive oncology decision support. Author summaryWe studied how to better monitor emotional tone in posts from online cancer peer-support communities, where patients and caregivers share experiences that may signal distress, coping, or unmet needs. Automated classification could help organizations and moderators identify when additional support may be needed, but these systems depend on the quality of the labels used for training and may miss clinical context. Using a public dataset of cancer survivor and caregiver posts, we trained and compared several machine-learning and deep-learning models to classify each post as negative, neutral, or positive. We tested two practical improvements. First, we used a large language model to generate an additional set of "AI labels" and examined how these differed from the original categories. Second, we extracted simple context information--whether the writer was a patient or caregiver and what cancer type was mentioned--and added this context to the text before model training. We found that adding context consistently improved performance across model types. However, the AI-generated labels shifted class distributions, indicating that automated labeling can introduce systematic changes that should be audited. Overall, simple context extraction can make emotional tone monitoring more accurate and useful for supportive oncology decision support.
Jonnalagadda, P.; Obeng-Gyasi, S.; Stover, D. G.; Andersen, B. L.; Rahurkar, S.
Show abstract
BackgroundMany patients with triple-negative breast cancer (TNBC), particularly those who are older, Black, or insured by Medicaid, do not receive guideline-concordant treatment, despite its association with up to 4x higher survival. Early identification of patients at risk for rapid relapse may enable timely interventions and improve outcomes. This study applies machine learning (ML) to real-world data to predict risk of rapid relapse in TNBC. MethodsWe trained various ML models (logistic regression, decision trees, random forests, XGBoost, naive Bayes, support vector machines) using National Cancer Database (NCDB) data and fine-tuned them using electronic health record (EHR) data from a cancer registry. Class imbalance was addressed using synthetic minority oversampling technique (SMOTE). Model performance was evaluated using sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), receiver operating characteristics area under the curve ROC AUC, accuracy, and F1 scores. Transfer learning, cross-validation, and threshold optimization were applied to enhance the ensemble models performance on clinical data. ResultsInitial models trained on NCDB data exhibited high NPV but low sensitivity and PPV. SMOTE and hyperparameter tuning produced modest improvements. External testing on EHR data from a cancer registry had similar model performance. After applying transfer learning, cross-validation, and threshold optimization using the clinical data, the ensemble model achieved higher performance. The optimized ensemble model achieved a sensitivity of 0.87, specificity of 0.99, PPV of 0.90, NPV of 0.98, ROC AUC of 0.99, accuracy of 0.98, and F1-score of 0.88. This optimized model, leveraging readily available clinical data, demonstrated superior performance compared to initial NCDB-trained models and those reported in extant literature. ConclusionsTransfer learning and threshold optimization effectively adapted ML models trained on NCDB data to an independent real-world clinical dataset from a single site, producing a high-performing model for predicting rapid relapse in TNBC. This model, potentially translatable to fast health interoperability resources (FHIR)-compatible workflows, represents a promising tool for identifying patients at high risk. Future work should include prospective external validation, evaluation of integration into clinical workflows, and implementation studies to determine whether the model improves care processes such as timely patient navigation and treatment planning. Author SummaryIn this study, we set out to understand which patients with triple-negative breast cancer might experience a rapid return of their disease. Many people with this aggressive form of cancer do not receive the treatments that are known to improve survival, especially patients who are older, Black, or insured through public programs. Being able to identify those at highest risk early in their care could help health teams provide timely support and ensure that patients receive the treatments they need. To do this, we used information from a large national cancer database to build computer-based models that learn from patterns in patient data. We then refined these models using real medical records from a cancer center to make sure they worked well in everyday clinical settings. After adjusting and improving the models, we developed a tool that can correctly identify most patients who are likely to have a rapid return of their cancer. Our hope is that this type of tool could eventually be built into routine care and help guide timely follow-up, support services, and treatment planning. More testing in real clinical environments will be important to understand how well the tool improves care and outcomes for patients.
Makani, A.
Show abstract
Medical oncology education faces a dual crisis: knowledge velocity that outpaces static curricula and large language model (LLM) risks--hallucination and automation bias--that threaten the fidelity of AI-assisted learning. We present Onco-Shikshak V7, an AI-native adaptive learning platform that addresses both challenges through a unified cognitive architecture grounded in learning science. The system replaces isolated educational modules with four authentic clinical workflows--Morning Report, Tumor Board, Clinic Day, and AI Textbook--each scaffolded by a nine-module pedagogy engine that integrates ACT-R activation dynamics (illness scripts), Item Response Theory (adaptive difficulty), the Free Spaced Repetition Scheduler (FSRS v4), Zone of Proximal Development (scaffolding), and metacognitive calibration training (Brier score). Six specialist AI agents--medical oncology, radiation oncology, surgical oncology, pathology, radiology, and oncology navigation--engage in multi-disciplinary deliberation with per-specialty retrieval-augmented generation (RAG) grounding across nine authoritative guideline sources including NCCN, ESMO, and ASTRO. The platform provides 18 clinical cases with decision trees across six cancer types, maps every interaction to 13 ACGME Hematology-Oncology milestones, and implements four closed-loop feedback mechanisms that connect session errors to targeted flashcards, weak domains to suggested cases, and all interactions to a persistent learner profile. Technical validation confirms algorithmic correctness across eight subsystems. To our knowledge, this is the first system to unify ACT-R, IRT, FSRS, ZPD, and metacognitive calibration in a single medical education platform. Formal learner evaluation via randomized controlled trial is planned.
Dennstaedt, F.; Bobnar, T.; Handra, A.; Putora, P. M.; Filchenko, I.; Brueningk, S.; Aebersold, D. M.; Cihoric, N.; Shelan, M.
Show abstract
BackgroundThe growing volume of biomedical literature, especially in oncology, necessitates automated tools for extracting clinically relevant information. Large Language Models (LLMs) offer promising capabilities for data extraction in this domain. However, their potential to extract clinically relevant information from case reports detailing rare treatment interactions, remains underexplored. MethodsWe systematically searched PubMed for case reports on interactions between radiotherapy (RT) and Pembrolizumab, Cetuximab, or Cisplatin. A random sample of 100 report abstracts for each therapy was manually classified by two independent medical experts using 17 Boolean questions about patient demographics, treatment, cancer type and outcome with mutually exclusive answers, forming a ground truth. An LLM-based system with the open-source GPT models (GPT-OSS-120B and GPT-OSS-20B) was applied to classify these reports and the remaining dataset entries using the defined question structure. Performance of the LLM-based information extraction was evaluated using the standard classification metrics accuracy, precision, recall, and F1-scores. ResultsThe systematic searches yielded 320 (Pembrolizumab), 147 (Cetuximab), and 2055 (Cisplatin) publications. Inter-rater agreement for manual classification was high (Cohens kappa = 0.87), though lower (0.60-0.80) for specific outcome and cancer type questions. The LLM-based classification (GPT-OSS-120B model) achieved high overall performance with an F1-score of 94.33% (95.83% accuracy, 93.69% precision, 94.98% recall). Performance was consistent across systemic therapies, with the smaller GPT-OSS-20B model showing similar results (F1-score 94.06%). Analysis of the entire datasets revealed that 56.02% of publications described patients who received both RT and systemic therapy. Proportions of positive and negative outcomes varied by therapy and sequencing. ConclusionsLLM-based classification systems demonstrate high accuracy and reliability for curating scientific case reports on RT and systemic therapy interactions. These findings support their potential for high-throughput hypothesis generation and knowledge base construction in oncology, particularly for underutilized case reports, with even smaller open-source models proving effective for such tasks.
Prelaj, A.; Miskovic, V.; Sacco, M.; Ferrarin, A.; Licciardello, C.; Provenzano, L.; Favali, M.; Lerma, L.; Zec, A.; Spagnoletti, A.; Ganzinelli, M.; Lorenzini, D.; Guirges, B.; Invernizzi, L.; Silvestri, C.; Mazzeo, L.; Meazza Prina, M.; Corrao, G.; Ruggirello, M.; Dumitrascu, A. D.; Di Mauro, R. M.; Monzani, D.; Pravettoni, G.; Zanitti, M.; Macocchi, D.; Marino, M.; Cavalli, C.; Romano, R.; Giani, C.; Armato, S. G.; Esposito, A.; Bestvina, C.; Spector, M.; Bogot, N. R.; Basheer, R.; Hafzadi, A. L.; Roisman, L.; Watermann, I.; Szewczyk, M.; Olchers, T.; Richter, H.; Blanke-Roeser, C.; Sinisca
Show abstract
Despite a decade of immunotherapy, treatment selection in non-small cell lung cancer (NSCLC) still relies on subgroup analyses and clinical scores. I3LUNG (NCT05537922) is currently the largest international, real-world, multimodal, artificial intelligence (AI)-based trial, enrolling 2365 patients. We integrated real-world clinical data (RWD), computed tomography (CT) images, digital pathology (DP), and genomics (G) into machine learning early-fusion (MLEF) and deep-learning intermediate-fusion (DLIF) models. MLEF achieved consistent performance across outcomes (AUC{approx}0.74), with improved results in first-line patients (AUC up to 0.82). Multimodal models outperformed RWD in clinical-specific subgroups (AUCs up to 0.86). In the test set, AI models surpassed PD-L1, ECOG PS, NLR, LDH (all with p<0.01) and the LIPI score. The clinical usability study showed that expert and non-expert physicians could improve their prediction with the explainable AI (XAI) tool. The I3LUNG tool emerges as a clinically relevant decision-support system and is currently under prospective validation in >2,000 patients.
Passweg, L. P.; Schwenke, J. M.; Schoenenberger, C. M.; Locher, F.; Picker, J.; Dieterle, M.; Thiele, B.; Hasler, D.; Danelli, A.; Schmitt, A. M.; Heye, T.; Stojanov, T.; Briel, M.; Kasenda, B.
Show abstract
ImportanceManual data extraction from clinical text is resource intensive. Locally hosted large language models (LLMs) may offer a privacy-preserving solution, but their performance on non-English data remains unclear. ObjectiveTo investigate whether the classification accuracy of locally hosted LLMs is non-inferior to human accuracy when determining metastasis status and treatment response from German radiology reports. DesignIn this retrospective comparative accuracy study, five locally hosted LLMs (llama3.3:70b, mistral-small:24b, qwq:32b, qwen3:32b, and gpt-oss:120b) were compared against humans. To calculate accuracy, a ground truth was established via duplicate human extraction and adjudication of discrepancies by a senior oncologist. Both initial human extraction and LLM outputs were compared against this ground truth. SettingThe study was conducted at a tertiary referral hospital in Switzerland; data processing and analyses took place inside the hospital network. Participants400 randomly sampled radiology reports from adult cancer patients (CT, MRI, PET) generated between January 2023 and May 2025. ExposuresAutomated classification of metastasis status and treatment response by LLMs using a standardized prompt pipeline compared to manual human review. Main Outcomes and MeasuresPrimary outcomes were non-inferiority (5 percentage points [pp] margin) of LLM classification accuracy compared with human accuracy for metastasis status (presence/absence by anatomical site) and treatment response categories. Secondary outcomes included accuracy for primary tumor diagnosis, radiological absence of tumor, and extraction time per report. ResultsThe analysis included 400 reports from 317 patients (mean age 63 years, 32% women). On the test set (n=300), human accuracy for metastasis status was 98.4% (95% CI 98.0%-98.8%). All LLMs were non-inferior; gpt-oss:120b performed best (97.6% accuracy; difference:xs -0.8pp [90% CI, -1.3 to -0.3 pp]). For response to treatment, human accuracy was 86.0% (95% CI 83.2%-88.8%). All LLMs were inferior; the most accurate model, gpt-oss:120b, achieved 78.3% (difference -7.7 pp [90% CI, -11.6 to -3.8 pp]). Mean human time per report was 120 seconds vs 11-63 seconds for LLMs. Conclusion and RelevanceIn this study, LLMs were non-inferior to human accuracy for classification of metastasis status but were inferior for response to treatment assessment. gpt-oss:120b was the most accurate among tested LLMs. Study RegistrationOSF: 45PVQ Key PointsO_ST_ABSQuestionC_ST_ABSCan locally hosted large language models (LLMs) match human performance when extracting sites of metastases and response to treatment from radiology reports of cancer patients? FindingsIn this preregistered, single center study of 300 German radiology reports, all evaluated LLMs were non-inferior to humans in extracting the presence or absence of metastasis by organ site, but LLMs were inferior to humans in classification of response to treatment. MeaningLLMs can be suitable for classification of metastasis status, whereas more caution is warranted for more complex tasks where additional clinical reasoning may be required.
Windisch, P.; Koechli, C.; Dennstaedt, F.; Aebersold, D. M.; Zwahlen, D. R.; Foerster, R.; Schroeder, C.
Show abstract
PurposeTo quantify run-to-run reproducibility of Gemini 3 Flash Preview and GPT-5.2 for biomedical trial-success classification across temperature and reasoning/thinking settings, and to assess whether single-run reporting is sufficient. MethodsWe utilized 250 randomized controlled oncology trial abstracts labeled POSITIVE/NEGATIVE based on primary endpoint success. With a fixed prompt requiring exactly "POSITIVE" or "NEGATIVE", we evaluated Gemini across thinking levels (minimal, low, medium, high) and temperatures 0.0 - 2.0, and GPT-5.2 across reasoning-effort levels (none to xhigh) with an additional temperature sweep when reasoning was disabled. Each setting was run three times. Reproducibility was quantified with Fleiss {kappa} across replicates, performance was summarized with F1 (per run and majority vote), and invalid-format outputs were recorded. ResultsGemini showed near-perfect agreement across settings ({kappa}=0.942 - 1.000), including perfect agreement at temperature 0. Invalid outputs were uncommon (0 - 1.5%). GPT-5.2 reproducibility was similarly high ({kappa}=0.984 - 0.995) with no invalid outputs. Performance remained stable (mean/majority-vote F1 = 0.955 - 0.971), and majority voting offered only marginal gains. ConclusionFor strict binary biomedical classification with tightly constrained outputs, both models were highly reproducible across common decoding and reasoning configurations, indicating that one run is often adequate while minimal replication provides a practical stability check.
Cai, L.; Zhang, T.; Beets-Tan, R.; Brunekreef, J.; Teuwen, J.
Show abstract
The use of Electronic Health Records (EHRs) has increased significantly in recent years. However, a substantial portion of the clinical data remains in unstructured text formats, especially in the context of radiology. This limits the application of EHRs for automated analysis in oncology research. Pretrained language models have been utilized to extract feature embeddings from these reports for downstream clinical applications, such as treatment response and survival prediction. However, a thorough investigation into which pretrained models produce the most effective features for rectal cancer survival prediction has not yet been done. This study explores the performance of five Dutch pretrained language models, including two publicly available models (RobBERT and MedRoBERTa.nl) and three developed in-house for the purpose of this study (RecRoBERT, BRecRoBERT, and BRec2RoBERT) with training on distinct Dutch-only corpora, in predicting overall survival and disease-free survival outcomes in rectal cancer patients. Our results showed that our in-house developed BRecRoBERT, a RoBERTa-based language model trained from scratch on a combination of Dutch breast and rectal cancer corpora, delivered the best predictive performance for both survival tasks, achieving a C-index of 0.65 (0.57, 0.73) for overall survival and 0.71 (0.64, 0.78) for disease-free survival. It outperformed models trained on general Dutch corpora (RobBERT) or Dutch hospital clinical notes (MedRoBERTa.nl). BRecRoBERT demonstrated the potential capability to predict survival in rectal cancer patients using Dutch radiology reports at diagnosis. This study highlights the value of pretrained language models that incorporate domain-specific knowledge for downstream clinical applications. Furthermore, it proves that utilizing data from related domains can improve the quality of feature embeddings for certain clinical tasks, particularly in situations where domain-specific data is scarce.
Bonetti, A.; Le, V.-L.; Carrero, Z. I.; Wolf, F.; Gustav, M.; Lam, S. W.; Vanhersecke, L.; Sobczuk, P.; LE LOARER, F.; Lenarcik, M.; Rutkowski, P.; van Sabben, J. M.; Steeghs, N.; van Boven, H.; Machado, I.; Bague, S.; Navarro, S.; Medina-Ceballos, E.; Agra, C.; Giner, F.; Tapia, G.; Hernandez Gallego, A.; Civantos Jubera, G.; Cuatrecasas, M.; Lopez-Prades, S.; Perret, R. E.; Soubeyran, I.; Khalifa, E.; Blouin, L.; Wardelmann, E.; Meurgey, A.; Collini, P.; Voloshin, A.; Yatabe, Y.; Hirano, H.; Gronchi, A.; Nishida, T.; Bouche, O.; Emile, J.-F.; NGO, C.; Hohenberger, P.; Cotarelo, C.; Jakob, J.
Show abstract
BackgroundGastrointestinal stromal tumor (GIST) is the most common gastrointestinal mesenchymal tumor, driven by tyrosine-protein kinase KIT and platelet-derived growth factor receptor A (PDGFRA) mutations. Specific variants, such as KIT exon 11 deletions, carry prognostic and therapeutic implications, whereas wild-type (WT) variants derive limited benefit from tyrosine kinase inhibitors (TKIs). Given the limited reproducibility of established clinicopathological risk models, deep learning (DL) applied to whole-slide images (WSIs) emerged as a promising tool for molecular classification and prognostic assessment. Patients and methodsWe analyzed 8398 GIST cases from 21 centers in 7 countries, including 7238 with molecular data and 2638 with clinical follow-up. DL models were trained on WSIs to predict mutations, treatment sensitivity, and recurrence-free survival (RFS). ResultsDL predicted mutational status in GIST from WSIs, with area under the curve (AUC) of 0.87 for KIT, 0.96 for PDGFRA. High performance was observed for subtypes, including KIT exon 11 delinss 557-558 (0.67) and PDGFRA exon 18 D842V (0.93). For therapeutic categories, performance reached 0.84 for avapritinib sensitivity, 0.81 for imatinib sensitivity. DL models predicted RFS, with hazard-ratios (HR) of 8.44 (95%CI 6.14-11.61) in the overall cohort and 4.74 (95%CI 3.34-6.74) in patients receiving adjuvant therapy. Prognostic performance was comparable to pathology-based scores, with highest discrimination in the overall cohort and in patients without adjuvant therapy (9.44, 95%CI (5.87-15.20)). ConclusionDL applied to WSIs enables prediction of molecular alterations, treatment sensitivity, and RFS in GIST, performing comparably to established risk scores across international cohorts, providing a baseline for future multimodal predictors. HighlightsO_LIDeep learning on histology predicts KIT and PDGFRA mutations in a large international cohort of GISTs from multiple centers C_LIO_LIWhole-slide image models stratify recurrence-free survival comparable to pathology-based risk scores C_LIO_LIPrognostic value of deep learning is preserved in adjuvant therapy subgroups, supporting treatment duration decisions C_LI O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=117 SRC="FIGDIR/small/26345350v1_ufig1.gif" ALT="Figure 1"> View larger version (36K): org.highwire.dtl.DTLVardef@652548org.highwire.dtl.DTLVardef@729a2borg.highwire.dtl.DTLVardef@1e7b6b9org.highwire.dtl.DTLVardef@18d6721_HPS_FORMAT_FIGEXP M_FIG O_FLOATNOGraphical abstract.C_FLOATNO Overview of study design and dataset characteristics. (A) Multinational collection of WSIs from seven countries (Spain, France, Italy, Germany, the Netherlands, Poland, and Japan), followed by standard image preprocessing with the STAMP pipeline and clinical data preprocessing/standardization via the Grammar Data Curation framework. The workflow was divided into two main branches: (i) molecular mutation and treatment sensitivity prediction, and (ii) RFS prediction. Model performance was evaluated using AUROC and F1 score for classification tasks, and Kaplan-Meier survival curves with hazard ratios for RFS. Model explainability was assessed through heatmaps of WSIs and identification of top predictive tiles. (B) Summary of clinical dataset composition: proportion of cases receiving adjuvant therapy, tumor location distribution, mutation distribution at the exon level, and mutation distribution at the codon level. C_FIG
Gallifant, J.; Chen, S.; Shin, K.-Y.; Kellogg, K. C.; Doyle, P. F.; Guo, J.; Ye, B.; Warrington, A.; Zhai, B. K.; Hadfield, M. J.; Gusev, A.; Ricciuti, B.; Christiani, D. C.; Aerts, H. J.; Kann, B. H.; Mak, R. H.; Nelson, T. L.; Nguyen, P.; Schoenfeld, J. D.; Topaloglu, U.; Catalano, P.; Hochheiser, H. H.; Warner, J. L.; Sharon, E.; Kozono, D. E.; Savova, G. K.; Bitterman, D.
Show abstract
Immune-related adverse events (irAEs) affect up to 40% of patients receiving immune checkpoint inhibitors, yet their identification depends on laborious and inconsistent manual chart review. Here we developed and evaluated an agentic large language model system to extract the presence, temporality, severity grade, attribution, and certainty of six irAE types from clinical notes. Retrospectively (263 notes), the system achieved macro-averaged F1 of 0.92 for detection and 0.66 for multi-class severity grading; self-consistency improved F1 by 0.14. The best-performing configuration cost approximately $0.02 per note. In prospective silent deployment over three months (884 notes), detection F1 was 0.72-0.79. In a randomized crossover study of clinical trial staff (17 participants, 316 observations), agentic assistance reduced annotation time by 40% (P < 0.001), increased complete-match accuracy (OR 1.45; 95% CI 1.01-2.09; P = 0.045), and improved inter-annotator agreement (Krippendorffs from 0.22-0.51 to 0.82-0.85). These results demonstrate that agentic AI coupled with human verification could enhance efficiency, performance, and consistency for irAE assessment.
Vellanki, S.; Feiszt, P.; Kenny, P. A.
Show abstract
Standard pathology workup sometimes fails to definitively identify tumor tissue-of-origin in cancers with ambiguous diagnoses or unknown primary sites, complicating treatment decisions. Molecular assays can aid diagnosis but require additional tissue and increase healthcare costs. Intending to leverage routinely collected somatic mutation profiles from comprehensive genomic profiling, we developed Tumor-Origin.com, a machine learning platform to predict tumor tissue-of-origin from mutation data alone. We trained five classifiers on 10,945 tumor mutation profiles from the MSK-IMPACT cohort and validated performance on an independent set of 770 tumors from the Gundersen Precision Oncology cohort spanning 52 cancer types. Performance was strongest for the most common tumor types, reflecting their relative over-representation in training data. Among cancer types with more than five cases, the Logistic Regression classifier achieved the highest average top-3 accuracy of 49%, followed by the Support Vector Machine at 43%. At least one algorithm delivered [≥]40% accuracy in 23 cancer types. Our integrated platform thus provides robust tumor origin predictions across diverse cancers. We have implemented a web-based tool (https://tumor-origin.com) to assist clinicians and researchers in refining diagnoses of cancers of unknown primary without requiring additional tissue or costly testing.
Shady, M.; Reardon, B.; Jiang, S.; Pimenta, E.; O'Meara, T.; Park, J.; kehl, K. L.; Elmarakeby, H. A.; Sunyaev, S. R.; Van Allen, E. M.
Show abstract
IntroductionPrecision oncology has informed cancer care by enabling the discovery and application of diagnostic, prognostic, and/or predictive molecular biomarkers. However, many patients lack actionable biomarkers or fail to respond to biomarker-directed therapies. Patient similarity approaches can leverage comprehensive tumor profiling and prior clinical experiences from large cohorts for decision support, facilitating broader realization of precision oncology insights. MethodsWe developed a deep learning-based modeling framework using real-world clinicogenomic data from a tertiary cancer center to (i) measure patient similarity based on embedded tumor genomic profiles and (ii) evaluate the association of derived patient subgroups and neighborhoods with shared therapeutic outcomes in breast cancer-specific and histology-agnostic pan-cancer settings. ResultsThe model recovered clinically meaningful patient clusters reflecting both expected and previously unknown therapeutic associations, as well as patient-specific neighborhoods that could inform therapeutic trajectories more often than expected by chance in multiple clinical contexts. Moreover, model utility extended to patients without actionable genomic biomarkers and those with cancer of unknown primary (CUP) diagnoses, where neighborhoods aligned with independently predicted primary cancer type. These neighborhoods could also be examined over time in a continuously learning scenario. ConclusionThis similarity-based modeling framework distilled complex molecular and clinical data into concise, context-specific insights that augment clinician judgment, providing a foundation for a real-time learning, patient-centered decision support model in precision oncology.
Veney, D. J.; Wei, L.; Miller, J. R.; Toland, A. E.; Presley, C. J.; Hampel, H.; Padamsee, T.; Bishop, M. J.; Kim, J. J.; Hovick, S. R.; Irvin, W. J.; Senter, L.; Stover, D.
Show abstract
Purpose: Tumor genomic testing (TGT) is standard-of-care for most patients with advanced/metastatic cancer. Despite established guidelines, patient education prior to TGT is frequently omitted. The purpose of this study was to evaluate the impact and durability of a concise 3-4 minute video for patient education prior to TGT in community versus academic sites and across cancer types. Patients and Methods: Patients undergoing standard-of-care TGT were enrolled at a tertiary academic institution in three cohorts: Cohort 1-breast cancer; Cohort 2-lung cancer; Cohort 3-other cancers. Cohort 4 consisted of patients with any cancer type similarly undergoing SOC TGT at one of three community cancer centers. Participants completed survey measures prior to video viewing (T1), immediately post-viewing (T2), and after return of TGT results (T3). Outcome measures included: 1) 10-question objective genomic knowledge/understanding (GKU); 2) 10-question video message-specific knowledge (VMSK); 3) 11-question Trust in Physician/Provider (TIPP); 4) perceptions regarding TGT. Results: A total of 203 participants completed all survey timepoints. Higher baseline GKU and VMSK scores were significantly associated with higher income and greater years of education. For the primary objective, there was a significant and sustained improvement in VMSK from T1:T2:T3 (Poverall p<0.0001), with no significant change in GKU (p=0.41) or TIPP (p=0.73). This trend was consistent within each cohort (all p[≤]0.0001). Results for four VMSK questions significantly improved, including impact on treatment decisions, incidental germline findings, and insurance coverage of testing. Conclusions: A concise, 3-4 minute, broadly applicable educational video administered prior to TGT significantly and sustainably improved video message-specific knowledge in diverse cancer types and in academic and community settings. This resource is publicly available at http://www.tumor-testing.com, with a goal to efficiently educate and empower patients regarding TGT while addressing guidelines within the flow of clinical practice.
He, Y.; Almadani, H.; Huang, S.; Monzy, J.; Li, D.; Ray, E.; Huang, X.
Show abstract
Implant-based breast reconstruction is the most common surgical option following mastectomy for breast cancer. Despite its prevalence, up to one-third of patients develop complications within two years. Existing machine-learning models for predicting the complications rely solely on structured clinical data, over-looking prognostic information in narrative pathology reports. Recent advances in large language models (LLMs) enable extraction of numeric and semantic information from clinical text, offering opportunities to improve predictive performance and interpretability. We developed a fully on-premises, open-source model that extracts numeric morphometrics and contextual embeddings from free-text pathology reports, and fuses them with 63 structured variables via a CLIP-style dual encoder. In a single-center cohort of 963 patients (Jan 2007-Jan 2022), the multimodal model improved composite-complication risk discrimination (AUROC from 0.691 with logistic regression, increased to 0.740 with clinical features, and further improved to 0.764 with the addition of pathology report text features; p=0.027) and enhanced sensitivity and positive predictive value at clinical thresholds. The automated extraction module we developed for numeric morphometrics (e.g. mastectomy-specimen weight) from free-text pathology reports achieved an accuracy of 96.3%. SHAP analyses confirmed established risk factors--expander-to-implant interval, body-mass index, and total mastectomy weight--as dominant drivers. In subgroup analyses, model performance remained robust, with particularly strong discrimination observed among specific populations, such as shorter expander-to-implant interval (EII), the AUROC reached 0.796 and accuracy was 0.799. These results show that on-premises, open-source LLMs reliably extract and fuse textual and structured clinical features to achieve clinically meaningful gains in predicting complications after implant-based breast reconstruction. While traditional models are constrained by the limited scope of structured variables, pathology text--when analyzed with modern language models--adds new, clinically relevant signals. Even modest statistical gains yield more accurate identification of high-risk patients, potentially informing surgical planning, patient counseling, and postoperative follow-up. These findings demonstrate that privacy-preserving language models, when integrated with contrastive multimodal alignment, can unlock prognostic information embedded in narrative pathology reports and enable interpretable, patient-level decision support. This interpretable, privacy-preserving multimodal framework offers a generalizable approach for enhancing risk prediction and clinical decision-making across surgical oncology. Significance StatementThis study demonstrates that multimodal fusion of pathology free-text and structured clinical data, enabled by on-premises large language models, improves complication-risk discrimination after implant-based breast reconstruction. The approach uncovers prognostic signals hidden in narrative reports, offering an interpretable, privacy-preserving framework for precision risk prediction in surgical oncology.
Abdollahyan, M.; BCNB-BCI, ; Chelala, C.
Show abstract
Common data models (CDMs) are essential for health data standardisation, which facilitates the governance and management of data, improves data quality and enhances the findability, accessibility, interoperability and reusability of data. They allow researchers to efficiently integrate health datasets and perform joint analysis on them, promoting collaboration and maximising translation of research outputs for patients benefit. We describe the process of transforming the biobank data for over 2,850 donors recruited at the Barts Cancer Institute (BCI) site of the Breast Cancer Now Biobank (BCNB) - the UKs first national breast cancer biobank hosting longitudinal biospecimens and associated clinical, genomic and imaging data - into the Observational Medical Outcomes Partnership (OMOP) CDM. Our transformation pipeline achieved high coverage, with 83% of source concepts mapped, and our OMOP CDM achieved a total pass rate of 100% in quality assessments. We present the breast cancer characteristics of the resultant patient cohort. We report several challenges faced during the transformation process and explain how we addressed them, and discuss the strengths and limitations of adopting the OMOP CDM for breast cancer research. The OMOP-mapped BCNB-BCI dataset is a valuable resource that can now be explored and analysed alongside other health datasets.
Gainullin, V. G.; Gray, M.; Kumar, M.; Luebker, S.; Lehman, A. M.; Choudhry, O. A.; Roberta, J.; Flake, D. D.; Shanmugam, A.; Cortes, K.; Chang, E.; Uren, P. J.; Mazloom, A.; Garces, J.; Silvestri, G. A.; Chesla, D. W.; Given, R. W.; Beer, T. M.; Diehl, F.
Show abstract
Multi-cancer early detection (MCED) tests can detect several cancer types and stages. We previously developed a methylation and protein (MP V1) MCED classifier. In this study, we present a refined MP V2 classifier, developed by evaluating model architectures that improved performance in prospectively enrolled case-control cohorts under standard testing conditions. The newly developed MP V2 classifier was trained to be more generalizable and achieve increased early-stage sensitivity at a target specificity of [≥]97.0%. MP V1 and MP V2 classifier performances were compared using a previously described test set, and MP V2 performance was also evaluated in a new independent clinical validation set. Compared to MP V1, the MP V2 classifier demonstrated a 7.3% increase in overall sensitivity, with sensitivity increases of 7.6%, 9.2%, and 8.3% for stages I, II, and stages I/II, respectively, in the intended use (breast and prostate cancers excluded) test set. In an independent validation intended use set, the MP V2 classifier showed an overall sensitivity of 55.6%, with sensitivities of 26.8%, 42.9%, and 34.8% for stages I, II, and stages I/II, respectively. In a case-control setting, the MP V2 classifier offered improved sensitivity for early-stage cancers at a lower specificity target.
Makani, A.; Agrawal, A.; Agrawal, A.
Show abstract
Medical image segmentation remains a critical bottleneck in clinical workflows, from diagnostic radiology to radiation oncology treatment planning. We present Onco-Seg, a medical imaging adaptation of Metas Segment Anything Model 3 (SAM3) that leverages promptable concept segmentation for automated tumor and organ delineation across multiple imaging modalities. Unlike previous SAM adaptations limited to single modalities, Onco-Seg introduces a unified framework supporting CT, MRI, ultrasound, dermoscopy, and endoscopy through modality-specific preprocessing and parameter-efficient fine-tuning with Low-Rank Adaptation (LoRA). We train on 35 datasets comprising over 98,000 cases across 8 imaging modalities using sequential checkpoint chaining on a 4-GPU distributed training infrastructure. We evaluate Onco-Seg on 12 benchmark datasets spanning breast, liver, prostate, lung, skin, and gastrointestinal pathologies, achieving strong performance on breast ultrasound (Dice: 0.752{+/-}0.24), polyp segmentation (Dice: 0.714{+/-}0.32), and liver CT (Dice: 0.641{+/-}0.12). We further propose two clinical deployment patterns: an interactive "sidecar" for diagnostic radiology and a "silent assistant" for automated radiation oncology contouring. We release an open-source napari plugin enabling interactive segmentation with DICOM-RT export for radiation oncology workflows. Code and models are available at https://github.com/inventcures/onco-segment.
Windisch, P.; Weyrich, J.; Dennstaedt, F.; Zwahlen, D. R.; Foerster, R.; Schroeder, C.
Show abstract
PurposeLarge language models (LLMs) are used for biomedical text processing, but individual decisions are often hard to audit. We evaluated whether enforcing a mechanically checkable "show your work" quote affects accuracy, stability, and verifiability for trial eligibility-scope classification from abstracts. MethodsWe used 200 oncology randomized controlled trials (2005 - 2023) and provided models with only the title and abstract. Trials were labeled with whether they allowed for the inclusion of patients with localized and/or metastatic disease. Three flagship models (GPT-5.2, Gemini 3 Flash, Claude Opus 4.5) were queried with default settings in two independent conditions: label-only and label plus a verbatim supporting quote. Models could abstain if they deemed the abstract to not contain sufficient information. Each condition was repeated three times per abstract. Quotes were mechanically validated as exact substrings after whitespace normalization, and a separate judge step used an LLM to rate whether each quote supported the assigned label. ResultsEvidence requirements modestly reduced coverage (GPT-5.2 86.2% to 84.3%, Gemini 98.3% to 92.8%, Claude 96.0% to 94.5%) by increasing abstentions and, for Gemini, invalid outputs. Conditional macro-F1 remained high but changed by model (slight gains for GPT-5.2 and Gemini, decrease for Claude). Labels were stable across repetitions (Fleiss kappa 0.829 to 0.969). Mechanically valid quotes occurred in 83.3% to 91.2% of runs, yet only 48.0% to 78.8% of evidence-bearing predictions were judged semantically supported. Restricting to supported predictions increased macro-F1 at the cost of lower coverage. ConclusionSubstring-verifiable quotes provide an automated audit trail and enable selective, higher-trust automation when applying LLMs to biomedical text processing. However, this approach introduces new failure modes and trades coverage for verifiability in a model-dependent way.
Windisch, P.; Koechli, C.; Dennstaedt, F.; Aebersold, D. M.; Zwahlen, d. R.; Foerster, R.; Schroeder, C.
Show abstract
PurposeLarge language models (LLMs) can classify biomedical documents accurately, but strong performance does not prove they are using the supplied text rather than identifier-triggered parametric knowledge. We tested whether oncology trial-success classification reflects "reading" of abstract evidence or "remembering" of known trials. MethodsWe used a corpus of 250 two-arm oncology randomized controlled trials from seven major journals (2005 - 2023) and asked the flagship models of three commercial vendors (OpenAI, Google, and Anthropic) to output a single label indicating whether the primary endpoint was met. For each trial we created five deterministic inputs: title+abstract (baseline), title-only, DOI-only, counterfactual title+abstract with the primary endpoint outcome minimally flipped, and the same counterfactual title+abstract paired with the original DOI to induce an identifier-text conflict. ResultsWith full title+abstract, models achieved near-ceiling performance (accuracy and F1 Score 0.96 - 0.97) and high format adherence (97.2 - 100%). Performance degraded stepwise with content removal (title-only accuracy and F1 Score 0.79 - 0.88, DOI-only 0.63 - 0.67), consistent with above-chance identifier-driven signal. Under counterfactual results, models followed the edited evidence (accuracy and F1 Score 0.96 - 0.99 against inverted labels). Adding the real DOI minimally affected GPT (accuracy and F1 Score {approx} 0.99) but modestly reduced Gemini (accuracy and F1 Score {approx} 0.97) and Claude (accuracy and F1 Score {approx} 0.95), mainly via lower sensitivity. ConclusionLLMs robustly track explicit endpoint statements in abstracts, yet identifiers can support above-chance predictions and occasionally compete with textual evidence. Progressive ablations plus counterfactual conflicts provide a practical, reproducible audit for grounding in biomedical LLM evaluations.